Stochastic Contextual Bandits with Long Horizon Rewards

نویسندگان

چکیده

The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step this direction by investigating contextual linear bandits where current reward depends on at most s prior actions contexts (not necessarily consecutive), up to time horizon h. In order avoid polynomial dependence h, we propose new algorithms that leverage sparsity discover pattern arm parameters jointly. We consider both data-poor (T= h) regimes derive respective regret upper bounds O(d square-root(sT) +min(q, T) O( square-root(sdT) ), with s, feature dimension d, total T, q is adaptive pattern. Complementing bounds, also show single trajectory brings inherent challenges: While form rank-1 matrix, circulant matrices are not isometric manifolds sample complexity indeed benefits from sparse structure. Our results necessitate analysis address long-range temporal dependencies across data Specifically, utilize connections restricted isometry property formed dependent sub-Gaussian vectors establish guarantees independent interest.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Contextual Bandits with Stochastic Experts

We consider the problem of contextual bandits with stochastic experts, which is a variation of the traditional stochastic contextual bandit with experts problem. In our problem setting, we assume access to a class of stochastic experts, where each expert is a conditional distribution over the arms given a context. We propose upper-confidence bound (UCB) algorithms for this problem, which employ...

متن کامل

Nonparametric Stochastic Contextual Bandits

We analyze the K-armed bandit problem where the reward for each arm is a noisy realization based on an observed context under mild nonparametric assumptions. We attain tight results for top-arm identification and a sublinear regret of Õ ( T 1+D 2+D ) , whereD is the context dimension, for a modified UCB algorithm that is simple to implement (kNN-UCB). We then give global intrinsic dimension dep...

متن کامل

Stochastic Contextual Bandits with Known Reward Functions

Many sequential decision-making problems in communication networks such as power allocation in energy harvesting communications, mobile computational offloading, and dynamic channel selection can be modeled as contextual bandit problems which are natural extensions of the well-known multi-armed bandit problem. In these problems, each resource allocation or selection decision can make use of ava...

متن کامل

Linear Contextual Bandits with Knapsacks

We consider the linear contextual bandit problem with resource consumption, in addition to reward generation. In each round, the outcome of pulling an arm is a reward as well as a vector of resource consumptions. The expected values of these outcomes depend linearly on the context of that arm. The budget/capacity constraints require that the total consumption doesn’t exceed the budget for each ...

متن کامل

Contextual Bandits with Similarity Information

In a multi-armed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a time-invariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs t...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2023

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v37i8.26140